Fix hallucinations during silence by jkarthic · Pull Request #2629 · ggml-org/whisper.cpp

jkarthic · 2024-12-14T18:00:22Z

When the predicted tokens end with a single timestamp the the entire 30 segment should be considered as done, to avoid hallucinations for the remaining part of segment.
This behaviour is on par with openai's whisper. Refer to logic related to single_timestamp_ending in https://github.com/openai/whisper/blob/main/whisper/transcribe.py

When the predicted tokens end with a single timestamp the the entire 30 segment should be considered as done, to avoid hallucinations for the remaining part of segment. This behaviour is on par with openai's whisper. Refer to logic related to `single_timestamp_ending` in https://github.com/openai/whisper/blob/main/whisper/transcribe.py

itsthisjustin · 2024-12-14T18:50:29Z

We need this so bad. Hopefully it'll work with the swift package?

jkarthic · 2024-12-15T07:09:32Z

We need this so bad. Hopefully it'll work with the swift package?

@itsthisjustin Yes, of course. The fix is done in the core whisper.cpp file. So any language binding using this version/branch will have the issue fixed.

mrfragger · 2024-12-15T08:36:58Z

gonna test this..here is 1.7.2

[00:01:07.360 --> 00:01:07.820] Father, for all-in-sacrifice, for all-in-sacrifice, for all-in-sacrifice, for all-in-sacrifice,
[00:01:07.820 --> 00:01:08.360] for all-in-sacrifice. Yes, hold on, hold on.
[00:01:08.360 --> 00:01:12.360] Father, for all-in-sacrifice, for all-in-sacrifice, for all-in-sacrifice. Yes, hold on.
[00:01:12.360 --> 00:01:14.360] D.C. now, see what's going on.
[00:01:14.360 --> 00:01:14.360] Hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey,
[00:01:14.360 --> 00:01:14.360] hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey,
[00:01:14.360 --> 00:01:14.360] hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey,
[00:01:14.360 --> 00:01:14.360] hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey,
[00:01:14.360 --> 00:01:14.360] hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey,
[00:01:14.360 --> 00:01:14.360] hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey, hey,
[00:01:14.360 --> 00:01:14.360] hey
[00:01:14.360 --> 00:01:15.360] D.C. now, see what's going on.
[00:01:15.360 --> 00:01:16.360] D.C. now, see what's going on.
[00:01:16.360 --> 00:01:17.360] D.C. now, see what's going on.
[00:01:17.360 --> 00:01:18.360] D.C. now, see what's going on.
[00:01:18.360 --> 00:01:19.360] D.C. now, see what's going on.
[00:01:19.360 --> 00:01:20.360] D.C. now, see what's going on.
[00:01:20.360 --> 00:01:21.360] D.C. now, see what's going on.
[00:01:21.360 --> 00:01:22.360] D.C. now, see what's going on.
[00:01:22.360 --> 00:01:23.360] D.C. now, see what's going on.
[00:01:23.360 --> 00:01:24.360] D.C. now, see what's going on.
[00:01:24.360 --> 00:01:25.360] D.C. now, see what's going on.
[00:01:25.360 --> 00:01:26.360] D.C. now, see what's going on.
[00:01:26.360 --> 00:01:27.360] D.C. now, see what's going on.
[00:01:27.360 --> 00:01:28.360] D.C. now, see what's going on.
[00:01:28.360 --> 00:01:29.360] D.C. now, see what's going on.
[00:01:29.360 --> 00:01:30.360] D.C. now, see what's going on.
[00:01:30.360 --> 00:01:31.360] D.C. now, see what's going on.
[00:01:31.360 --> 00:01:32.360] D.C. now, see what's going on.
[00:01:32.360 --> 00:01:33.360] D.C. now, see what's going on.
[00:01:33.360 --> 00:01:34.360] D.C. now, see what's going on.
[00:01:34.360 --> 00:01:35.360] D.C. now, see what's going on.
[00:01:35.360 --> 00:01:36.360] D.C. now, see what's going on.
[00:01:36.360 --> 00:01:37.360] D.C. now, see what's going on.
[00:01:37.360 --> 00:01:38.360] D.C. now, see what's going on.
[00:01:38.360 --> 00:01:39.360] D.C. now, see what's going on.
[00:01:39.360 --> 00:01:40.360] D.C. now, see what's going on.
[00:01:40.360 --> 00:01:41.360] D.C. now, see what's going on.
[00:01:41.360 --> 00:01:42.360] D.C. now, see what's going on.
[00:01:42.360 --> 00:01:43.360] D.C. now, see what's going on.
[00:01:43.360 --> 00:01:44.360] D.C. now, see what's going on.
[00:01:44.360 --> 00:01:45.360] D.C. now, see what's going on.
[00:01:45.360 --> 00:01:46.360] D.C. now, see what's going on.
[00:01:46.360 --> 00:01:47.360] D.C. now, see what's going on.
[00:01:47.360 --> 00:01:48.360] D.C. now, see what's going on.
[00:01:48.360 --> 00:01:49.360] D.C. now, see what's going on.
[00:01:49.360 --> 00:01:50.360] D.C. now, see what's going on.
[00:01:50.360 --> 00:01:51.360] D.C. now, see what's going on.
[00:01:51.360 --> 00:01:52.360] D.C. now, see what's going on.
[00:01:52.360 --> 00:01:53.360] D.C. now, see what's going on.
[00:01:53.360 --> 00:01:54.360] D.C. now, see what's going on.
[00:01:54.360 --> 00:01:55.360] D.C. now, see what's going on.
[00:01:55.360 --> 00:01:56.360] D.C. now, see what's going on.
[00:01:56.360 --> 00:01:57.360] D.C. now, see what's going on.
[00:01:57.360 --> 00:01:58.360] D.C. now, see what's going on.
[00:01:58.360 --> 00:01:59.360] D.C. now, see what's going on.
[00:01:59.360 --> 00:02:00.360] D.C. now, see what's going on.
[00:02:00.360 --> 00:02:01.360] D.C. now, see what's going on.
[00:02:01.360 --> 00:02:02.360] D.C. now, see what's going on.
[00:02:02.360 --> 00:02:03.360] D.C. now, see what's going on.
[00:02:03.360 --> 00:02:04.360] D.C. now, see what's going on.
[00:02:04.360 --> 00:02:05.360] D.C. now, see what's going on.
[00:02:05.360 --> 00:02:06.360] D.C. now, see what's going on.
[00:02:06.360 --> 00:02:07.360] D.C. now, see what's going on.
[00:02:07.360 --> 00:02:08.360] D.C. now, see what's going on.
[00:02:08.360 --> 00:02:09.360] D.C. now, see what's going on.
[00:02:09.360 --> 00:02:10.360] D.C. now, see what's going on.
[00:02:10.360 --> 00:02:11.360] D.C. now, see what's going on.
[00:02:11.360 --> 00:02:12.360] D.C. now, see what's going on.
[00:02:12.360 --> 00:02:13.360] D.C. now, see what's going on.
[00:02:13.360 --> 00:02:14.360] D.C. now, see what's going on.
[00:02:14.360 --> 00:02:15.360] D.C. now, see what's going on.
[00:02:15.360 --> 00:02:16.360] D.C. now, see what's going on.
[00:02:16.360 --> 00:02:17.360] D.C. now, see what's going on.
[00:02:17.360 --> 00:02:18.360] D.C. now, see what's going on.
[00:02:18.360 --> 00:02:19.360] D.C. now, see what's going on.
[00:02:19.360 --> 00:02:20.360] D.C. now, see what's going on.
[00:02:20.360 --> 00:02:21.360] D.C. now, see what's going on.
[00:02:21.360 --> 00:02:22.360] D.C. now, see what's going on.
[00:02:22.360 --> 00:02:23.360] D.C. now, see what's going on.
[00:02:23.360 --> 00:02:24.360] D.C. now, see what's going on.
[00:02:24.360 --> 00:02:25.360] D.C. now, see what's going on.
[00:02:25.360 --> 00:02:26.360] D.C. now, see what's going on.

output_srt: saving output to '0155.srt'

now let's see with the patch ...downloaded the new whisper.cpp in src
make clean
make -j
and got exact same result. This is with large-v3-turbo and only large-v2_q8_0 made it not repeat. So I believe it's more about the models rather than whisper.cpp which causes repeating phrases. This audiobook I'm doing is 81 hrs and break it into 2000 audio segments to avoid long periods of hallucinations. So 100 hour audiobook I can get it to 3 min segments.

2000 ( 3m chapters ) = 6,000 minutes or 100 hours
1500 ( 4m chapters ) = 6,000 minutes or 100 hours
1200 ( 5m chapters ) = 6,000 minutes or 100 hours
1000 ( 6m chapters ) = 6,000 minutes or 100 hours
857 ( 7m chapters ) = 6,000 minutes or 100 hours
750 ( 8m chapters ) = 6,000 minutes or 100 hours
666 ( 9m chapters ) = 6,000 minutes or 100 hours
600 (10m chapters ) = 6,000 minutes or 100 hours

Duration of audiobook 294660 seconds
Duration of audiobook 81h:51m:00s

Total number of chapters: 187

Average length of chapters
1576 seconds or 00h:26m:16s

1625 chunks for ~182 secs or 00h:03m:02s splits
1650 chunks for ~180 secs or 00h:03m:00s splits
1675 chunks for ~177 secs or 00h:02m:57s splits
1700 chunks for ~174 secs or 00h:02m:54s splits
1725 chunks for ~172 secs or 00h:02m:52s splits
1750 chunks for ~169 secs or 00h:02m:49s splits
1775 chunks for ~167 secs or 00h:02m:47s splits
1800 chunks for ~165 secs or 00h:02m:45s splits
1825 chunks for ~162 secs or 00h:02m:42s splits
1850 chunks for ~160 secs or 00h:02m:40s splits
1875 chunks for ~158 secs or 00h:02m:38s splits
1900 chunks for ~156 secs or 00h:02m:36s splits
1925 chunks for ~154 secs or 00h:02m:34s splits
1950 chunks for ~152 secs or 00h:02m:32s splits
1975 chunks for ~150 secs or 00h:02m:30s splits
2000 chunks for ~148 secs or 00h:02m:28s splits

jkarthic · 2024-12-15T09:54:43Z

@mrfragger
The issue you are facing might be different from the one that I have fixed.
input_1734180845782.wav.zip
Please try the above wav file.
Here is the output with the 1.7.2
[00:00:00.000 --> 00:00:03.420] activity is like hey here's a picture of my fridge can you tell me what I'm
[00:00:03.420 --> 00:00:07.140] missing because I'm going grocery shopping and I really need to do
[00:00:07.140 --> 00:00:09.680] recipes.
[00:00:09.680 --> 00:00:11.720] you
[00:00:11.720 --> 00:00:13.780] you
[00:00:13.780 --> 00:00:15.820] you
[00:00:15.820 --> 00:00:25.820] [BLANK_AUDIO]

And here is the output with this fixed branch.
[00:00:00.000 --> 00:00:03.420] activity is like hey here's a picture of my fridge can you tell me what I'm
[00:00:03.420 --> 00:00:07.140] missing because I'm going grocery shopping and I really need to do
[00:00:07.140 --> 00:00:09.680] recipes.

Command line : ./main -t 1 -bs 1 -bo 1 -m ../../models/ggml-small.en.bin input_1734180845782.wav

Please note that the extra hallucinations are removed in this branch.
This PR doesn't try to fix any limitations in the whisper model. It just tries to bring the implementation on par with openai's whisper. I noticed that openai's whisper implementation doesn't have that extra hallucinations for the attached file. When I tried to find the rootcause, found this discrepancy and fixed it.

jkarthic · 2024-12-15T09:56:42Z

@mrfragger
If you can share the file you are testing with, I can run it with openai's whisper implementation to see if the issue is with the core whisper model or due to any minor bugs in whisper.cpp implementation.

mrfragger · 2024-12-15T15:27:36Z

It's a really bad audio recording of a conversation...that portion. Anyway yeah I most of the time I will eliminate all silence before compiling the audiobook to transcribe. Also if there are music intros and outros trim those if feasible. I believe your patch is addressing the silence so if that does indeed work for that it would be a huge boon. So far I'm been running your patch for the last 6 or 7 hours and no negative effects or anything unusual.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

# By Georgi Gerganov (4) and others # Via GitHub * ggerganov/master: stream : improve consistency in README (ggml-org#2642) whisper : support no_speech_thold (ggml-org#2625) whisper : add single-timestamp logic (ggml-org#2629) readme : fix typo (ggml-org#2637) cmake : fix "amd64" processor string (ggml-org#2638) vulkan : fix soft_max.comp division by zero (ggml-org#2633) common : add cstdio header stream : update build instructions android : fix build and ci (ggml-org#2624) models : fix typo in download-ggml-model.sh (ggml-org#2623) ruby : Sync whisper.cpp and model download feature (ggml-org#2617) scripts : update to new build system # Conflicts: # src/whisper.cpp

When a specific language is forced (e.g. -l ru, -l es) and a 30-second decoder window is entirely zero-valued, whisper emits language-specific fallback tokens (bracketed music tags like [Música], fake subtitle-editor credits on -l ru). The auto-detect path handles silent chunks naturally. Add a chunk-level zero-PCM check at the top of the seek loop inside whisper_full_with_state. When the current window is all-zero and the caller forced a language, emit a single [BLANK_AUDIO] segment for that chunk and advance without running the encoder or decoder. Matches the approach endorsed in PR ggml-org#1588 review ("skip entire segments when silence is detected"), using zero-PCM as a stricter and language- independent signal than no_speech_prob. The caller's original language intent is captured before the auto- detect block overwrites params.language, so the guard only fires when the user explicitly requested a specific language; auto-detect paths are unchanged. Fixes ggml-org#1724 (residual hallucination on forced-language silence chunks not addressed by ggml-org#2629)

ggerganov approved these changes Dec 15, 2024

View reviewed changes

Comment thread src/whisper.cpp Outdated

Accept review comments related to formatting.

2fe659b

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggerganov merged commit 2f2841b into ggml-org:master Dec 17, 2024

achyutbenz19 mentioned this pull request Apr 19, 2026

whisper : skip decoding of zero-filled chunks on forced-language path (#1724) #3763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hallucinations during silence#2629

Fix hallucinations during silence#2629
ggerganov merged 2 commits into
ggml-org:masterfrom
highlight-ing:fix_hallucinations

jkarthic commented Dec 14, 2024

Uh oh!

itsthisjustin commented Dec 14, 2024

Uh oh!

jkarthic commented Dec 15, 2024

Uh oh!

mrfragger commented Dec 15, 2024

Uh oh!

jkarthic commented Dec 15, 2024 •

edited

Loading

Uh oh!

jkarthic commented Dec 15, 2024 •

edited

Loading

Uh oh!

mrfragger commented Dec 15, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jkarthic commented Dec 14, 2024

Uh oh!

itsthisjustin commented Dec 14, 2024

Uh oh!

jkarthic commented Dec 15, 2024

Uh oh!

mrfragger commented Dec 15, 2024

Uh oh!

jkarthic commented Dec 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkarthic commented Dec 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrfragger commented Dec 15, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jkarthic commented Dec 15, 2024 •

edited

Loading

jkarthic commented Dec 15, 2024 •

edited

Loading